RCAS

The RNA Centric Annotation System Analysis Report

About RCAS


RCAS (RNA Centric Annotation System) is an automated system that provides dynamic annotations for custom input files that contain transcriptomic target regions. Such transcriptomic target regions could be, for instance, peak regions detected by CLIP-Seq analysis that detect protein-RNA interactions, MeRIP-Seq analysis that detect RNA modifications (alias the epitranscriptome), or any collection of target regions at the level of the transcriptome.

GO term enrichment analysis


RCAS overlays the input target regions with the annotated protein-coding genes and calculates the Gene Ontology (GO) terms that may be enriched or depleted in the input target regions compared to the background list of protein-coding genes. A Classical Fisher's Exact Test is applied for each GO term and the p-values obtained for each GO term is corrected for multiple testing using both the False Discovery Rate and the Family-Wise Error Rate.

Gene Set Enrichment Analysis


Similarly to the GO term enrichment analysis, RCAS also detects sets of genes as annotated in the Molecular Signatures Database that are enriched or depleted in the queried target regions. Results are corrected for multiple-testing according to both the False Discovery Rate and the Family-Wise Error Rate.

1 Summary Figures


1.1 Distribution of query regions across gene features

Figure 1: The number of query regions that overlap different kinds of gene features are counted. The ‘y’ axis denotes the types of gene features included in the analysis and the ‘x’ axis denotes the percentage of query regions (out of total number of query regions denoted with “n”) that overlap at least one genomic interval that host the corresponding feature. Notice that the sum of the percentage values for different features don’t add up to 100%, because some query regions may overlap multiple kinds of features. If the query regions don’t overlap any gene features, they are classified as “intergenic”.

1.2 Distribution of query regions across RNA genes

Figure 2: The number of query regions that overlap different kinds of RNA genes are counted. The ‘y’ axis denotes the types of gene features included in the analysis and the ‘x’ axis denotes the percentage of query regions (out of total number of query regions denoted with “n”) that overlap at least one genomic interval that host the corresponding RNA gene type. Notice that the sum of the percentage values for different RNA genes don’t add up to 100%, because some query regions may overlap multiple kinds of RNA genes.

1.3 Distribution of query regions in the genome grouped by gene types

Figure 3: The number of query regions that overlap different kinds of gene types are counted. The ‘x’ axis denotes the types of genes included in the analysis and the ‘y’ axis denotes the percentage of query regions (out of total number of query regions denoted with “n”) that overlap at least one genomic interval that host the corresponding gene type. If the query regions don’t overlap any known genes, they are classified as “Unknown”.

1.4 Distribution of query regions across the chromosomes grouped by gene features

Figure 4: The number of query regions that overlap different chromosomes are counted. For each chromosome, the frequency of query regions are further split into groups based on the gene features the query overlaps with. The ‘x’ axis denotes the chromosomes included in the analysis and the ‘y’ axis denotes the frequency of overlaps.


1.5 Interactive table of genes that overlap query regions


2 Coverage Profiles

2.1 Coverage profile of query regions across the length of transcripts

Figure 6: The query regions are overlaid with the genomic coordinates of transcripts. The transcripts are divided into 100 bins of equal length and for each bin the number of query regions that cover the corresponding bin is counted. Transcripts shorter than 100bp are excluded. Thus, a coverage profile of the transcripts is obtained based on the distribution of the query regions. The strandedness of the transcripts are taken into account. The coverage profile is plotted in the 5’ to 3’ direction.

2.2 Coverage profile of query regions across the length of Exons

Figure 7: The query regions are overlaid with the genomic coordinates of each exon of each transcript. The exons are divided into 100 bins of equal length and for each bin the number of query regions that cover the corresponding bin is counted. Exons shorter than 100bp are excluded. Thus, a coverage profile of the exons is obtained based on the distribution of the query regions. The strandedness of the exons are taken into account. The coverage profile is plotted in the 5’ to 3’ direction.

2.3 Coverage profile of query regions across the 100 bp region centered on exon-intron junctions

Figure 8: The query regions are overlaid with the genomic coordinates of each exon-intron junction of each transcript. The junction comprises of a 50 bp region of an exon and 50 bp region of its neighboring intron. Exon-intron junctions are divided into 100 bins of equal length and for each bin the number of query regions that cover the corresponding bin is counted. Exons shorter than 100bp are excluded. Thus, a coverage profile of the exon-intron junctions is obtained based on the distribution of the query regions. The strandedness of the exons are taken into account. The coverage profile is plotted in the 5’ to 3’ direction.

2.4 Coverage profile of query regions across the length of introns

Figure 9: The query regions are overlaid with the genomic coordinates of each intron of each transcript. The introns are divided into 100 bins of equal length and for each bin the number of query regions that cover the corresponding bin is counted. Introns shorter than 100bp are excluded. Thus, a coverage profile of the introns is obtained based on the distribution of the query regions. The strandedness of the introns are taken into account. The coverage profile is plotted in the 5’ to 3’ direction.

2.5 Coverage profile of query regions across the promoter regions

Figure 10: The query regions are overlaid with the genomic coordinates of each promoter region of each transcript. The promoter region is defined as the region spanning from 2000bp upstream of the transcription start site and the first 200bp region after the transcription start site. The promoters are divided into 100 bins of equal length and for each bin the number of query regions that cover the corresponding bin is counted. Thus, a coverage profile of the promoters is obtained based on the distribution of the query regions. The strandedness of the promoters are taken into account. The coverage profile is plotted in the 5’ to 3’ direction.

2.6 Coverage profile of query regions across the length of 5’ UTRs

Figure 11: The query regions are overlaid with the genomic coordinates of each 5’ UTR region of each transcript. The 5’ UTR regions are divided into 100 bins of equal length and for each bin the number of query regions that cover the corresponding bin is counted. Thus, a coverage profile of the 5’ UTR regions is obtained based on the distribution of the query regions. The strandedness of the promoters are taken into account. The coverage profile is plotted in the 5’ to 3’ direction.

2.7 Coverage profile of query regions across the length of 3’ UTRs

Figure 12: The query regions are overlaid with the genomic coordinates of each 3’ UTR region of each transcript. The 3’ UTR regions are divided into 100 bins of equal length and for each bin the number of query regions that cover the corresponding bin is counted. Thus, a coverage profile of the 3’ UTR regions is obtained based on the distribution of the query regions. The strandedness of the promoters are taken into account. The coverage profile is plotted in the 5’ to 3’ direction.


3 GO term and Pathway Enrichment Results

3.1 GO Term Enrichment Results for Biological Processes

3.2 GO Term Enrichment Results for Molecular Functions

3.3 GO Term Enrichment Results for Cellular Compartments

3.4 Gene Set Enrichment Results based on MSigDB

4 TOP MEME motifs discovered in the query regions

Figure 13: The genomic sequences of the regions that are covered by each query region is extracted from the fasta file of the genome. Then, MEME was run to find enriched motif patterns in the list of query regions. The logos of the discovered motif patterns and the corresponding statistical test results are provided below.

MOTIF 1 MEME width = 8 sites = 517 llr = 2914 E-value = 4.1e-103

alt text

MOTIF 2 MEME width = 8 sites = 108 llr = 897 E-value = 7.0e-035

alt text

MOTIF 3 MEME width = 8 sites = 38 llr = 342 E-value = 1.9e+004

alt text

4.1 Discovered consensus motifs and their frequency in the target transcriptome

Figure 14: The frequency of the top 10 discovered motifs in the transcriptome is plotted.

4.2 Top genes with most discovered types of motifs

4.3 Distribution of discovered motifs in gene types

Figure 15: The frequency of the top 10 discovered motifs in the transcriptome is plotted with respect to different types of genes.

4.4 Distribution of discovered motifs in gene features

Figure 16: The frequency of the top 10 discovered motifs in the transcriptome is plotted with respect to different types of gene features.

5 Acknowledgements

RCAS is developed by Dr. Altuna Akalin (head of the Scientific Bioinformatics Platform), Dr. Dilmurat Yusuf (Bioinformatics Scientist), and Dr. Bora Uyar (Bioinformatics Scientist) at the Berlin Institute of Medical Systems Biology (BIMSB) at the Max-Delbrueck-Center for Molecular Medicine (MDC) in Berlin.

RCAS is developed as a bioinformatics service as part of the RNA Bioinformatics Center, which is one of the eight centers of the German Network for Bioinformatics Infrastructure (de.NBI).